# Load data
load("../data/Residen.RData")
Data <- Residen
# Extract variables by category
time_vars <- colnames(Data)[1:4] # Time variables (START YEAR, START QUARTER, etc.)
project_vars <- paste0("V", 1:8) # Project physical/financial variables (V1-V8)
economic_lag_vars <- list( # Define economic lag groups (Lag1-Lag5)
Lag1 = paste0("V", 9:27), # V9-V27
Lag2 = paste0("V", 28:46), # V28-V46
Lag3 = paste0("V", 47:65), # V47-V65
Lag4 = paste0("V", 66:84), # V66-V84
Lag5 = paste0("V", 85:103)) # V85-V103
output_vars <- c("V104", "V105") # Output variables (sales price and construction costs)
# Define a Reusable Function for Correlation Heatmaps
plot_cor_heatmap <- function(data, vars, title = "Correlation Heatmap",
k_col = 2, k_row = 2,
fontsize_row = 8, fontsize_col = 8,
width = 800, height = 600) {
# Define colour gradient: red-white-blue
custom_palette <- colorRampPalette(c(
"#FF0000", # Red (for -1)
"#FFFFFF", # White (for 0)
"#00008B" # Dark blue (for 1.0)
))(50)
# Extract subset of variables
selected_vars <- data[, vars, drop = FALSE]
cor_matrix <- cor(selected_vars, use = "complete.obs")
# Plot interactive heatmap
heatmaply(
cor_matrix,
colours = custom_palette, # Apply custom colour
limits = c(-1, 1), # Force colour range to [-1,1]
k_col = k_col,
k_row = k_row,
fontsize_row = fontsize_row,
fontsize_col = fontsize_col,
main = title,
show_dendrogram = c(TRUE, TRUE), # Show row/column dendrograms
width = width,
height = height
)
}
1.Time Variables vs. Outputs
# Analyse Sales Price (V104) by starting year
ggplot(Data, aes(x = as.factor(`START YEAR`), y = V104)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Sales Price by Starting Year", x = "Year", y = "Sales Price (V104)")
# Analyse Sales Price (V104) by starting quarter
ggplot(Data, aes(x = as.factor(`START QUARTER`), y = V104)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Sales Price by Starting Quarter", x = "Quarter", y = "Sales Price (V104)")
# Analyse Construction Costs (V105) by starting year
ggplot(Data, aes(x = as.factor(`START YEAR`), y = V105)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Construction Costs by Starting Year", x = "Year", y = "Construction Costs (V105)")
# Analyse Construction Costs (V105) by starting quarter
ggplot(Data, aes(x = as.factor(`START QUARTER`), y = V105)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Construction Costs by Starting Quarter", x = "Quarter", y = "Construction Costs (V105)")
cor_time_output <- plot_cor_heatmap(
data = Data,
vars = c(time_vars, output_vars),
title = " Correlation: Time Variables vs. Outputs")
cor_time_output
Time Variables vs. Output Variables – Observations
From the heatmap and boxplots, several structural and temporal patterns emerge within the dataset. These include consistent relationships between project milestones, cost estimates, and pricing variables, as well as clear associations that reflect the influence of time and sequence in the construction process. The observed groupings offer insights into how planning variables, cost projections, and market-facing indicators interact over the lifecycle of residential projects. The specifics are:
The correlation between V100 and V101 is extremely high (r = 0.988), indicating that project start and completion years are nearly linearly aligned. This pattern reflects the stability of construction durations across projects, where scheduling norms and timelines result in tightly coupled milestone years
The correlation between V100 and V101 is extremely high, indicating that project start and end years are almost perfectly aligned.2.Project Variables vs. Outputs
cor_project_output <- plot_cor_heatmap(
data = Data,
vars = c(project_vars, output_vars),
title = " Correlation: Project Variables vs. Outputs")
cor_project_output
Projects Variables vs. Output Variables – Observations
Based on the correlation heatmap and the variables described, here are some pattern we found:
# Step 1: Save all heatmaply objects into a list
heatmap_list <- lapply(names(economic_lag_vars), function(lag_name) {
current_vars <- c(economic_lag_vars[[lag_name]], output_vars)
plot_cor_heatmap(
data = Data,
vars = current_vars,
title = paste("Economic Variables:", lag_name),
k_col = 3,
k_row = 3
)
})
# Step 2: Show them one by one
heatmap_list[[1]]
heatmap_list[[2]]
heatmap_list[[3]]
heatmap_list[[4]]
heatmap_list[[5]]
Economic Variables vs. Output Variables – Observations
A long-term negative relationship with economic indicators is explained by Interest Rate Effects captured by variables representing “The number of loans extended by banks” in:
This indicates a strong inverse relationship between banking loan volume and downstream economic activity. The consistency across all five lags suggests that tighter lending conditions constrains real estate market activites, including building permits, floor area expansions, and sales prices. Such correlations align with current events and macroeconomic theory: as borrowing costs increase, construction and investment activity tend to decelerate, hender economic growth.
Across all five time periods, a group of variables:
show high correlations, (0.88 - 0.999), with:
As these are consitant across the five time lags, there is a potential for strong and reliable relationships. The ten economic indicators move in lockstep and indicate positive influencees on construction costs and sales prices, independant of time periods. This stability of relationships across all time lags, indicates potentially fundamental and reliable model of economic drivers for construction and real estate markets.
The dendrogram and accompanying correlation heatmap reveal two distinct clusters of closely related variables:
These variables cluster tightly, reflecting a strong positive correlation, and are visually confirmed by a deep blue hue in the heatmap. This relationship is intuitive, as a higher number of building permits typically entails a greater cumulative floor area.
These variables also appear in close proximity, with similarly strong positive correlation (deep blue), suggesting that more populous cities tend to receive more bank loans—likely due to greater economic activity and housing demand.
In contrast, both pairs exhibit weaker correlations with other variables, as indicated by lighter blue shades elsewhere in the heatmap. This clustering structure underscores two coherent real-world mechanisms:
lr_model <- lm(V104~ . - V105,data = Data)
#summary(lr_model)
html_summary(lr_model)
Call:
lm(formula = V104 ~ . - V105, data = Data)
Residuals:
Min 1Q Median 3Q Max
-901.15 -52.29 -1.50 45.71 645.31
Coefficients: (32 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.004e+05 1.999e+05 0.502 0.615781
`START YEAR` -1.614e+03 3.388e+03 -0.476 0.634175
`START QUARTER` -4.928e+02 2.457e+03 -0.201 0.841139
`COMPLETION YEAR` 1.521e+02 1.689e+01 9.007 < 2e-16 ***
`COMPLETION QUARTER` 5.984e+01 8.881e+00 6.738 8.36e-11 ***
V1 -4.779e+00 2.225e+00 -2.148 0.032529 *
V2 6.630e-02 2.195e-02 3.020 0.002746 **
V3 -2.331e-01 6.451e-02 -3.612 0.000356 ***
V4 6.442e-03 3.570e-02 0.180 0.856905
V5 -6.548e-01 3.342e-01 -1.960 0.050975 .
V6 8.764e-02 6.322e-02 1.386 0.166727
V7 NA NA NA NA
V8 1.203e+00 1.681e-02 71.579 < 2e-16 ***
V9 2.016e-01 1.103e+00 0.183 0.855159
V10 1.373e+02 9.675e+02 0.142 0.887245
V11 1.048e+01 7.426e+01 0.141 0.887830
V12 -1.626e+02 6.837e+02 -0.238 0.812200
V13 1.329e-04 1.125e-01 0.001 0.999058
V14 1.799e-01 4.339e-01 0.415 0.678643
V15 -2.265e+01 5.008e+01 -0.452 0.651433
V16 8.696e-01 2.753e+00 0.316 0.752315
V17 -4.247e-03 1.860e-01 -0.023 0.981801
V18 2.328e+02 1.235e+03 0.188 0.850672
V19 -2.007e-01 4.647e+00 -0.043 0.965587
V20 -3.910e-01 1.187e+00 -0.329 0.742147
V21 6.422e-02 3.565e-01 0.180 0.857174
V22 3.682e-03 4.524e-01 0.008 0.993511
V23 6.413e+00 6.128e+02 0.010 0.991658
V24 4.270e+01 2.347e+02 0.182 0.855747
V25 5.267e-02 1.033e-01 0.510 0.610369
V26 -4.675e-04 4.268e-01 -0.001 0.999127
V27 9.455e-04 9.654e-03 0.098 0.922046
V28 5.915e-02 1.795e-01 0.330 0.741930
V29 -1.144e+02 8.626e+02 -0.133 0.894615
V30 -7.247e+00 1.452e+02 -0.050 0.960240
V31 -9.085e+01 3.036e+02 -0.299 0.764953
V32 -1.780e-03 8.287e-02 -0.021 0.982879
V33 -5.666e-02 3.707e-01 -0.153 0.878630
V34 -9.607e+00 1.077e+02 -0.089 0.929008
V35 9.323e-01 1.976e+00 0.472 0.637449
V36 1.490e-02 8.094e-02 0.184 0.854034
V37 7.421e+01 6.343e+02 0.117 0.906942
V38 -4.162e-01 6.246e+00 -0.067 0.946913
V39 6.458e-01 2.013e+00 0.321 0.748563
V40 -8.375e-02 1.877e-01 -0.446 0.655784
V41 1.615e-01 5.671e-01 0.285 0.776052
V42 1.890e+02 6.380e+02 0.296 0.767239
V43 -1.891e+02 8.982e+02 -0.211 0.833373
V44 -1.202e-02 5.531e-01 -0.022 0.982673
V45 -1.184e-02 3.554e-01 -0.033 0.973441
V46 -1.038e-03 8.216e-03 -0.126 0.899575
V47 1.040e-01 2.590e-01 0.402 0.688186
V48 1.119e+02 1.334e+03 0.084 0.933224
V49 -3.006e+01 4.980e+01 -0.604 0.546571
V50 -1.746e+02 4.440e+02 -0.393 0.694395
V51 6.526e-03 1.409e-01 0.046 0.963102
V52 -7.985e-02 4.933e-01 -0.162 0.871519
V53 6.439e+00 1.296e+02 0.050 0.960422
V54 2.405e+00 4.657e+00 0.516 0.605918
V55 -1.698e-02 1.631e-01 -0.104 0.917133
V56 3.089e+01 2.974e+02 0.104 0.917333
V57 -2.722e-01 2.233e+00 -0.122 0.903072
V58 3.539e-01 3.079e+00 0.115 0.908564
V59 -9.660e-02 5.540e-01 -0.174 0.861684
V60 -2.140e-01 1.307e+00 -0.164 0.870070
V61 4.198e+01 1.636e+02 0.257 0.797647
V62 2.040e+02 1.364e+03 0.150 0.881170
V63 -1.273e-01 8.479e-01 -0.150 0.880716
V64 -2.178e-02 3.772e-01 -0.058 0.953994
V65 -9.490e-04 8.072e-03 -0.118 0.906483
V66 6.260e-02 6.990e-01 0.090 0.928700
V67 -2.018e+02 7.758e+02 -0.260 0.794924
V68 6.947e+01 7.047e+01 0.986 0.324982
V69 1.248e+02 1.397e+02 0.894 0.372099
V70 -9.328e-03 4.437e-02 -0.210 0.833629
V71 -1.346e-01 4.248e-01 -0.317 0.751583
V72 -1.492e+01 2.247e+02 -0.066 0.947118
V73 NA NA NA NA
V74 NA NA NA NA
V75 NA NA NA NA
V76 NA NA NA NA
V77 NA NA NA NA
V78 NA NA NA NA
V79 NA NA NA NA
V80 NA NA NA NA
V81 NA NA NA NA
V82 NA NA NA NA
V83 NA NA NA NA
V84 NA NA NA NA
V85 NA NA NA NA
V86 NA NA NA NA
V87 NA NA NA NA
V88 NA NA NA NA
V89 NA NA NA NA
V90 NA NA NA NA
V91 NA NA NA NA
V92 NA NA NA NA
V93 NA NA NA NA
V94 NA NA NA NA
V95 NA NA NA NA
V96 NA NA NA NA
V97 NA NA NA NA
V98 NA NA NA NA
V99 NA NA NA NA
V100 NA NA NA NA
V101 NA NA NA NA
V102 NA NA NA NA
V103 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 148.6 on 296 degrees of freedom
Multiple R-squared: 0.9879, Adjusted R-squared: 0.9848
F-statistic: 321.9 on 75 and 296 DF, p-value: < 2.2e-16
The linear regression model predicting actual sales price (V104) offers several key insights regarding model quality, predictor significance, and structural challenges. The following evaluation addresses how well the model fits the data, the relative influence of selected variables, the distribution of residual errors, and any underlying multicollinearity that may affect interpretability or reliability. These diagnostic elements are critical in assessing both the explanatory power and the robustness of the fitted model. From the summary of the linear regression model, we can find that:
1.Residuals: Residuals are roughly symmetric around zero (median = -1.50), indicating indicating unbiased predictions on average.However, residuals range widely (-901.15 to 645.31), suggesting the model struggles with extreme values.
2.Coefficients and Significance: COMPLETION YEAR (β = 152.1, p < 0.001) and COMPLETION QUARTER (β = 59.84, p < 0.001): These two variables both have a significant positive impact on the model, indicating that the completion year and quarter are crucial predictors of the outcome. V3(β = -0.2331, p < 0.001) and V8(β = 1.203, p < 0.001):V3 has a significant negative impact, while V8 has a significant positive impact,, with V3 stands for Lot area and V8 to Price of the unit at the beginning of the project per m2. V1(β = -4.779, p < 0.05) and V2(β = 0.0663, p < 0.05):V1 has a significant negative impact, while V2 has a significant positive impact. at the 0.05 level, with V1 stands for project locality and V2 to total floor area of the buildings. V5(β = -0.6548, p < 0.1) :V5 has a marginally significant negative impact,stands for Preliminary estimated construction cost based on the prices at the beginning of the project. The linear regression model supports the findings from my previous data visualization analysis.
3.Model Performance: Adjusted R-squared: 0.9848: This tells us that the model explains about 98.48% of the changes in actual sales prices (V104). This shows the model is a good fit for the data.But High R-squared with excessive predictors (75 variables) may harm generalizability. We need careful about the overfitting risk for this model. F-statistic = 321.9 with a p-value < 2.2e-16, indicating that Model is statistically significant overall.
Step 1: Data Preparation
# Load the Parkinson's dataset
parkinsons_data <- read.csv("../data/parkinsons.csv", header = TRUE)[-1]
# Function to split and scale data
prepare_data <- function(data, seed) {
set.seed(seed)
n <- nrow(data)
train_indices <- sample(1:n, 30)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
X_train <- as.matrix(train_data[, paste0("X", 1:97)])
y_train <- train_data$UPDRS
X_test <- as.matrix(test_data[, paste0("X", 1:97)])
y_test <- test_data$UPDRS
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test,
center = attr(X_train_scaled, "scaled:center"),
scale = attr(X_train_scaled, "scaled:scale"))
return(list(
X_train = X_train_scaled,
y_train = y_train,
X_test = X_test_scaled,
y_test = y_test
))
}
lm_data <- prepare_data(parkinsons_data, 123)
# Fit a linear model
lm_model <- lm(lm_data$y_train ~ lm_data$X_train)
#summary(lm_model)
html_summary(lm_model, title = "trial")
Call:
lm(formula = lm_data$y_train ~ lm_data$X_train)
Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!
Coefficients: (68 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.14 NaN NaN NaN
lm_data$X_trainX1 -555.71 NaN NaN NaN
lm_data$X_trainX2 -34.59 NaN NaN NaN
lm_data$X_trainX3 585.96 NaN NaN NaN
lm_data$X_trainX4 -153.57 NaN NaN NaN
lm_data$X_trainX5 -599.82 NaN NaN NaN
lm_data$X_trainX6 -24.87 NaN NaN NaN
lm_data$X_trainX7 51.83 NaN NaN NaN
lm_data$X_trainX8 170.76 NaN NaN NaN
lm_data$X_trainX9 -225.87 NaN NaN NaN
lm_data$X_trainX10 28.81 NaN NaN NaN
lm_data$X_trainX11 173.40 NaN NaN NaN
lm_data$X_trainX12 -86.58 NaN NaN NaN
lm_data$X_trainX13 31994.20 NaN NaN NaN
lm_data$X_trainX14 6745.51 NaN NaN NaN
lm_data$X_trainX15 1379.28 NaN NaN NaN
lm_data$X_trainX16 -32628.15 NaN NaN NaN
lm_data$X_trainX17 -3149.37 NaN NaN NaN
lm_data$X_trainX18 188.29 NaN NaN NaN
lm_data$X_trainX19 713.14 NaN NaN NaN
lm_data$X_trainX20 -239.04 NaN NaN NaN
lm_data$X_trainX21 -283.17 NaN NaN NaN
lm_data$X_trainX22 278.35 NaN NaN NaN
lm_data$X_trainX23 489.80 NaN NaN NaN
lm_data$X_trainX24 -125.69 NaN NaN NaN
lm_data$X_trainX25 -32067.51 NaN NaN NaN
lm_data$X_trainX26 -6759.54 NaN NaN NaN
lm_data$X_trainX27 -1427.30 NaN NaN NaN
lm_data$X_trainX28 32511.82 NaN NaN NaN
lm_data$X_trainX29 3023.98 NaN NaN NaN
lm_data$X_trainX30 NA NA NA NA
lm_data$X_trainX31 NA NA NA NA
lm_data$X_trainX32 NA NA NA NA
lm_data$X_trainX33 NA NA NA NA
lm_data$X_trainX34 NA NA NA NA
lm_data$X_trainX35 NA NA NA NA
lm_data$X_trainX36 NA NA NA NA
lm_data$X_trainX37 NA NA NA NA
lm_data$X_trainX38 NA NA NA NA
lm_data$X_trainX39 NA NA NA NA
lm_data$X_trainX40 NA NA NA NA
lm_data$X_trainX41 NA NA NA NA
lm_data$X_trainX42 NA NA NA NA
lm_data$X_trainX43 NA NA NA NA
lm_data$X_trainX44 NA NA NA NA
lm_data$X_trainX45 NA NA NA NA
lm_data$X_trainX46 NA NA NA NA
lm_data$X_trainX47 NA NA NA NA
lm_data$X_trainX48 NA NA NA NA
lm_data$X_trainX49 NA NA NA NA
lm_data$X_trainX50 NA NA NA NA
lm_data$X_trainX51 NA NA NA NA
lm_data$X_trainX52 NA NA NA NA
lm_data$X_trainX53 NA NA NA NA
lm_data$X_trainX54 NA NA NA NA
lm_data$X_trainX55 NA NA NA NA
lm_data$X_trainX56 NA NA NA NA
lm_data$X_trainX57 NA NA NA NA
lm_data$X_trainX58 NA NA NA NA
lm_data$X_trainX59 NA NA NA NA
lm_data$X_trainX60 NA NA NA NA
lm_data$X_trainX61 NA NA NA NA
lm_data$X_trainX62 NA NA NA NA
lm_data$X_trainX63 NA NA NA NA
lm_data$X_trainX64 NA NA NA NA
lm_data$X_trainX65 NA NA NA NA
lm_data$X_trainX66 NA NA NA NA
lm_data$X_trainX67 NA NA NA NA
lm_data$X_trainX68 NA NA NA NA
lm_data$X_trainX69 NA NA NA NA
lm_data$X_trainX70 NA NA NA NA
lm_data$X_trainX71 NA NA NA NA
lm_data$X_trainX72 NA NA NA NA
lm_data$X_trainX73 NA NA NA NA
lm_data$X_trainX74 NA NA NA NA
lm_data$X_trainX75 NA NA NA NA
lm_data$X_trainX76 NA NA NA NA
lm_data$X_trainX77 NA NA NA NA
lm_data$X_trainX78 NA NA NA NA
lm_data$X_trainX79 NA NA NA NA
lm_data$X_trainX80 NA NA NA NA
lm_data$X_trainX81 NA NA NA NA
lm_data$X_trainX82 NA NA NA NA
lm_data$X_trainX83 NA NA NA NA
lm_data$X_trainX84 NA NA NA NA
lm_data$X_trainX85 NA NA NA NA
lm_data$X_trainX86 NA NA NA NA
lm_data$X_trainX87 NA NA NA NA
lm_data$X_trainX88 NA NA NA NA
lm_data$X_trainX89 NA NA NA NA
lm_data$X_trainX90 NA NA NA NA
lm_data$X_trainX91 NA NA NA NA
lm_data$X_trainX92 NA NA NA NA
lm_data$X_trainX93 NA NA NA NA
lm_data$X_trainX94 NA NA NA NA
lm_data$X_trainX95 NA NA NA NA
lm_data$X_trainX96 NA NA NA NA
lm_data$X_trainX97 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 29 and 0 DF, p-value: NA
The R-squared value of 1 indicates a perfect fit on the training data. However, this is likely due to overfitting, as the model is using 97 predictors with only 30 observations. In such cases, the model memorises the data instead of learning meaningful relationships, which results in poor generalization to new data.
# General LASSO function to fit and return key results
lasso_regression <- function(X_train, y_train, X_test, y_test) {
# Define lambda grid for tuning
grid <- 10^seq(3, -1, length.out = 100)
# Perform Leave-One-Out Cross-Validation
lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1, lambda = grid,
nfolds = nrow(X_train), thresh = 1e-10)
# Extract the best lambda value
optimal_lambda <- lasso_cv$lambda.min
# Fit final LASSO model using the best lambda
lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = optimal_lambda)
# Predict on the test set and compute mean squared error
y_pred <- predict(lasso_model, newx = X_test)
test_error <- mean((y_test - y_pred)^2)
# Get the names of selected (non-zero) features
coef_full <- coef(lasso_model)
selected_features <- rownames(coef_full)[coef_full[, 1] != 0][-1]
# Build the final model formula as a string
formula_terms <- sprintf("%.4f * %s", coef_full[selected_features, 1], selected_features)
formula_str <- paste("UPDRS =", round(coef_full[1,1], 4), "+", paste(formula_terms, collapse = " + "))
# Return results
return(list(
lambda = optimal_lambda,
test_error = test_error,
selected_features = selected_features,
n_nonzero = length(selected_features),
formula = formula_str,
cv_plot = lasso_cv
))
}
# Prepare data and fit the model
data123 <- prepare_data(parkinsons_data, 123)
result123 <- lasso_regression(data123$X_train, data123$y_train, data123$X_test, data123$y_test)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Plot cross-validation errors vs. log(lambda)
plot(result123$cv_plot)
# Print key results
cat("Best lambda:", result123$lambda, "\n")
## Best lambda: 1.023531
cat("Test set MSE:", result123$test_error, "\n")
## Test set MSE: 20.41405
cat("Number of selected features:", result123$n_nonzero, "\n")
## Number of selected features: 4
cat("Selected features:", paste(result123$selected_features, collapse = ", "), "\n")
## Selected features: X9, X82, X83, X97
cat("LASSO model equation:\n", result123$formula, "\n")
## LASSO model equation:
## UPDRS = 26.1423 + 0.2039 * X9 + 0.0586 * X82 + 0.3937 * X83 + 7.0572 * X97
The LASSO regression identified a lambda value (≈1.02) that minimised cross-validated MSE, resulting in a model with only four non-zero coefficients: X9, X82, X83, and X97. This demonstrates strong feature selection capability, improving interpretability and reducing overfitting risks. The resulting formula list below: \[ \text{UPDRS} = 26.1423 + 0.2039 \times X_9 + 0.0586 \times X_{82} + 0.3937 \times X_{83} + 7.0572 \times X_{97} \] With LASSO, the final model includes only four key predictors, which significantly improves interpretability. The coefficients provide direct insight into how each selected feature contributes to the UPDRS score. This sparse model performs well (MSE ≈ 20.4), showing a good balance between accuracy and simplicity.
# Prepare data with a different seed and fit the model
data321 <- prepare_data(parkinsons_data, 321)
result321 <- lasso_regression(data321$X_train, data321$y_train, data321$X_test, data321$y_test)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Plot cross-validation result
plot(result321$cv_plot)
# Print key results
cat("Best lambda:", result321$lambda, "\n")
## Best lambda: 0.9326033
cat("Test set MSE:", result321$test_error, "\n")
## Test set MSE: 4.899441
cat("Number of selected features:", result321$n_nonzero, "\n")
## Number of selected features: 3
cat("Selected features:", paste(result321$selected_features, collapse = ", "), "\n")
## Selected features: X83, X95, X97
cat("LASSO model equation:\n", result321$formula, "\n")
## LASSO model equation:
## UPDRS = 25.4929 + 0.2028 * X83 + 0.2533 * X95 + 9.0117 * X97
# Build a comparison table of the two models
comparison_table <- data.frame(
Seed = c(123, 321),
Optimal_Lambda = c(result123$lambda, result321$lambda),
Test_MSE = c(result123$test_error, result321$test_error),
Num_Selected_Features = c(result123$n_nonzero, result321$n_nonzero),
Selected_Features = c(
paste(result123$selected_features, collapse = ", "),
paste(result321$selected_features, collapse = ", ")
)
)
# Display the table
print(comparison_table)
## Seed Optimal_Lambda Test_MSE Num_Selected_Features Selected_Features
## 1 123 1.0235310 20.414054 4 X9, X82, X83, X97
## 2 321 0.9326033 4.899441 3 X83, X95, X97
When We repeated the analysis with a different random seed (from 123 to 321), which changed the training-test split. The test set Mean Squared Error (MSE) for the previous dataset was 20.414054, while for the new dataset, it decreased to 4.899441. This reduction indicates that the model trained with the new dataset performs better on the test set.
The final model selected a slightly different set of features. Under seed = 123, the selected features were: X9, X82, X83, X97 Under seed = 321, the selected features were: X83, X95, X97
While some features like X83 and X97 appeared in both models, others (e.g., X9, X82 and X95) were different. This indicates that the feature selection process is sensitive to the specific training-test split and that small changes in data partitioning can lead to different optimal subsets.
The final model formula list below: \[ \text{UPDRS} = 25.4929 + 0.2028 \times X_{83} + 0.2533 \times X_{95} + 9.0117 \times X_{97} \]
# Load the data
weather_data <- read.csv("../data/Weather_Station_Data_v1.csv")
# Set seed for reproducibility
set.seed(321)
# Split the data into training and test sets
train_indices <- sample(1:nrow(weather_data), size = 0.8 * nrow(weather_data))
train_data <- weather_data[train_indices, ]
test_data <- weather_data[-train_indices, ]
# Define predictors and response
X_train <- as.matrix(train_data[, -which(names(train_data) == "MEAN_ANNUAL_RAINFALL")])
y_train <- train_data$MEAN_ANNUAL_RAINFALL
X_test <- as.matrix(test_data[, -which(names(test_data) == "MEAN_ANNUAL_RAINFALL")])
y_test <- test_data$MEAN_ANNUAL_RAINFALL
# Define a sequence of alpha values
alphas <- seq(0, 1, by = 0.1)
# Perform cross-validation for each alpha
cv_results <- lapply(alphas, function(a) {
cv.glmnet(X_train, y_train, alpha = a, nfolds = 10)
})
# Find the best model based on minimum cross-validated error
best_alpha_idx <- which.min(sapply(cv_results, function(cv) min(cv$cvm)))
best_cv_mod <- cv_results[[best_alpha_idx]]
best_alpha <- alphas[best_alpha_idx]
cat("Best alpha selected is:", best_alpha, "\n")
## Best alpha selected is: 0.3
Base on the cross validation, the best alpha of Elastic Net in this dataset is 0.3, in the next step, the alpha will be set to 0.3 to obtain the best result.
lambda_min <- best_cv_mod$lambda.min
lambda_1se <- best_cv_mod$lambda.1se
# Plot the cross-validation results
plot(best_cv_mod, main = "", xlab = "Log(lambda)", ylab = "Mean Squared Error")
title(main = "ElasticNet Cross-Validation Results", line = 2)
abline(v = log(lambda_min), col = "green", lty = 2)
abline(v = log(lambda_1se), col = "red", lty = 2)
legend("topright", legend = c("lambda.min", "lambda.1se"),
col = c("green", "red"), lty = 2, bty = "n")
cat("The value of Lambda.min is: ", lambda_min, "\n")
## The value of Lambda.min is: 0.3456271
cat("The value of Lambda.1se is: ", lambda_1se, "\n")
## The value of Lambda.1se is: 20.7198
The plot shows the mean squared error (MSE) across different values
of lambda. The green line represents lambda.min (with
lowest MSE), and the red line is lambda.1se (within one
standard error for simpler model).
# Predictions
pred_min <- predict(best_cv_mod, s = "lambda.min", newx = X_test)
pred_1se <- predict(best_cv_mod, s = "lambda.1se", newx = X_test)
# Calculate MSE and RMSE
mse_min <- mean((y_test - pred_min)^2)
rmse_min <- sqrt(mse_min)
mse_1se <- mean((y_test - pred_1se)^2)
rmse_1se <- sqrt(mse_1se)
# Count non-zero coefficients
coef_min <- coef(best_cv_mod, s = "lambda.min")
coef_1se <- coef(best_cv_mod, s = "lambda.1se")
n_pred_min <- sum(coef_min != 0) - 1 # exclude intercept
n_pred_1se <- sum(coef_1se != 0) - 1
# Create summary table
comparison_table <- data.frame(
Model = c("lambda.min", "lambda.1se"),
Alpha = c(best_alpha, best_alpha),
Lambda = c(lambda_min, lambda_1se),
MSE = c(mse_min, mse_1se),
RMSE = c(rmse_min, rmse_1se),
Predictors_Used = c(n_pred_min, n_pred_1se)
)
print(comparison_table)
## Model Alpha Lambda MSE RMSE Predictors_Used
## 1 lambda.min 0.3 0.3456271 11412.89 106.8311 13
## 2 lambda.1se 0.3 20.7198017 13078.75 114.3624 9
# Extract and display coefficients
coef_table_min <- data.frame(
Variable = rownames(as.matrix(coef_min)),
Coefficient = round(as.vector(coef_min), 4)
)
coef_table_1se <- data.frame(
Variable = rownames(as.matrix(coef_1se)),
Coefficient = round(as.vector(coef_1se), 4)
)
cat("Coefficients for lambda.min:\n")
## Coefficients for lambda.min:
print(coef_table_min)
## Variable Coefficient
## 1 (Intercept) 1043.8279
## 2 ALTITUDE 0.1367
## 3 MEAN_ANNUAL_AIR_TEMP -44.2016
## 4 MEAN_MONTHLY_MAX_TEMP 22.8969
## 5 MEAN_MONTHLY_MIN_TEMP -5.3594
## 6 MEAN_ANNUAL_WIND_SPEED -14.1155
## 7 MEAN_CLOUD_COVER 3.7076
## 8 MEAN_ANNUAL_SUNSHINE -0.0120
## 9 MAX_MONTHLY_WIND_SPEED -8.1163
## 10 MAX_AIR_TEMP -29.2842
## 11 MAX_WIND_SPEED -0.2383
## 12 MAX_RAINFALL 20.6347
## 13 MIN_AIR_TEMP 27.2047
## 14 MEAN_RANGE_AIR_TEMP 22.3804
cat("Coefficients for lambda.1se:\n")
## Coefficients for lambda.1se:
print(coef_table_1se)
## Variable Coefficient
## 1 (Intercept) 612.3961
## 2 ALTITUDE 0.1671
## 3 MEAN_ANNUAL_AIR_TEMP -4.9645
## 4 MEAN_MONTHLY_MAX_TEMP 0.0000
## 5 MEAN_MONTHLY_MIN_TEMP -1.0373
## 6 MEAN_ANNUAL_WIND_SPEED -8.0770
## 7 MEAN_CLOUD_COVER 1.6533
## 8 MEAN_ANNUAL_SUNSHINE -0.0031
## 9 MAX_MONTHLY_WIND_SPEED 0.0000
## 10 MAX_AIR_TEMP -16.0839
## 11 MAX_WIND_SPEED 0.0000
## 12 MAX_RAINFALL 19.1592
## 13 MIN_AIR_TEMP 9.3296
## 14 MEAN_RANGE_AIR_TEMP 0.0000
Based on the comparison of the two models using lambda.min and lambda.1se,We evaluated two elastic net regression models with alpha = 0.3, using lambda.min and lambda.1se. The model with lambda.min had a lower MSE of 11,412.89 and RMSE of 106.83, while the lambda.1se model had a slightly higher MSE of 13,078.75 and RMSE of 114.36.
In terms of complexity, the lambda.min model used 13 predictors, whereas the lambda.1se model used only 9 predictors, indicating stronger regularization.
This shows a typical trade-off: lambda.min provides better predictive accuracy, but with more variables in the model. lambda.1se gives a more parsimonious model (fewer predictors), which may be preferred for simplicity or interpretability, even at the cost of slightly higher error.
In real-world applications, especially when working with environmental or climate-related data, interpretability and generalizability are often more important than marginal gains in accuracy. A simpler model is also easier to communicate and deploy.
Therefore, the lambda.1se model strikes a better balance between performance and model simplicity.
The formula for the final model is as follows: \[ \begin{aligned} \hat{Y} =\ & 612.3961 \\ &+ 0.1671 \cdot \text{ALTITUDE} \\ &- 4.9645 \cdot \text{MEAN\_ANNUAL\_AIR\_TEMP} \\ &- 1.0373 \cdot \text{MEAN\_MONTHLY\_MIN\_TEMP} \\ &- 8.0770 \cdot \text{MEAN\_ANNUAL\_WIND\_SPEED} \\ &+ 1.6533 \cdot \text{MEAN\_CLOUD\_COVER} \\ &- 0.0031 \cdot \text{MEAN\_ANNUAL\_SUNSHINE} \\ &- 16.0839 \cdot \text{MAX\_AIR\_TEMP} \\ &+ 19.1592 \cdot \text{MAX\_RAINFALL} \\ &+ 9.3296 \cdot \text{MIN\_AIR\_TEMP} \end{aligned} \]